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CATEGORY PROCESSING OF QUERY 
TOPICS AND ELECTRONIC DOCUMENT 
CONTENT TOPICS 

HELD OF THE INVENTION 

This invention relates to the field of electronic content 
provision. More specifically, it relates to gathering related 
content from internet and intranet sources and providing 
access to same in response to user requests. 

BACKGROUND OF THE INVENTION 

A huge quantity of information is being continuously 
created and made available via electronic communications 
systems. There is so much information that it is simply not 
possible for an individual person to read il all. On the other 
hand, it is imperative that certain items of information reach 
pe^ain people. Much of . the electronically-provided news 
information ages rapidly, such that it loses its relevancy in 
a matter of days, or even a matter of hours (e.g., stock market 
information). Each person has different needs for 
information, and requires access to a different subset of the 
available information. In light of the foregoing, there is 
clearly a need for a system and method for rapidly accessing 
categorized electronic information. 

One aspect of the problem arises because the information 
is being created in many different places. News articles 
about events in the world or business community, and 
articles written for newspapers, magazines and joumaJs, can 
generally be obtained through various content providers, 
who frequently aggregate the information from a number of 
soiirces into single continuous elearonic streams. No con- 
lent provider today, however, provides access to all available 
information, so there is a trade-off between fall access and 
complexity. Moreover, an individual user is frequently 
forced to subscribe to a host of services in order to obtain the 
information which is generated from different sources, in 
different countries, and in various languages. Subscribing to 
many services to some extent negates the benefits realized 
by the content aggregation by providers, since the user must 
then often filter through multiple copies of the same docu- 
ments. 

IntemaUy, organizations face similar issues. Memos, 
announcements, documents of various kinds, and intranet 
web content are created at multiple locations throughout an 
organization, yet are generally not readily available to all 
members of the organization. Therefore, the process of 
collecting the information from all points of origins is a key 
issue, along with categorization and controlled dissemina* 
tion of that information. 

Another a:^ect of the problem is the actual matching 
process, comprising matching the collected and categorized 
content with an individual user's interests. For matching to 
work, an individual user must be able to express a diverse set 
of interests, not just one interest. A language of some kind 
is necessary to provide a medium for this expression of the 
user's interest. Further, a system is needed to capture the 
language and apply it to the items of information. Moreover, 
the language must embody some kind of high level semantic 
knowledge, since past word-search-based systems have 
fallen short of a satisfactory solution. The ability to express, 
capture and apply a person's interests or needs is a critical 
feature of the problem. 

Finally, there is a need to deliver the information to people 
who have expressed an interest. The primary requisites for 
delivery are making sure that access to the information is 
convenient, even in dynamic situations, and making sure 
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that delivery can occur quickly once the information 
becomes available. Moreover, people are increasingly 
mobile and have varied styles of working and of accessing 
and processing information. An effective delivery system 
5 will therefore require that the means of access be ubiquitous, 
that multiple means of access be available, and that delays 
in making the information available be minimized. 

It is therefore an objective of the present invention to 
provide a system for gathering, categorizing, and delivering 
10 electronic content to users in response to user requests. 

It is another objective of the invention to provide a system 
and method for gathering content from both inside (i.e., 
intranet) and outside (i;e., internet) sources and categorizing 
same for provision in response to customized user requests. 

Yet another objective of the present invention is to pro- 
vide a customer with the ability to embed user interest and 
delivery .mechanisms into customer applications.- ^ . - 

SUMMARY OF THE INVENTION 

20 

These and other objectives are realized by the present 
invention which provides a method and system for catego- 
rizing metadata about content provided via the internet or 
intranet; for categorizing user query content; and for match- 
25 ing and delivering categorized information tailored to cus- 
tomized user profiles. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will now be further detailed with specific 
30 reference to the appended Figures wherein: 

FIG. 1 provides a schematic illustration of an implemen- 
tation of the present invention. 

FIG. 2 provides a schematic illustration of the intranet 
side of one embodiment of the inventive system. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

FIG. 1 provides a schematic overview of one implemen- 

^ tation of the present invention. The implementation can be 
viewed as having two sides, an external side comprising the 
soiurces, content providers and System Server, and an inter- 
nal side comprising the customer's site (including at least 
one server for the customer's intranet), internal sources and 

45 end xiser sites. As detailed therein, sources 11 provide 
electronic content (e.g., articles) on-line to content providers 
12. The System Server, 10, gathers electronic content from 
content providers, as well as directly from sources, if 
necessary. At the System Server, the electronic content is 

50 categorized, with duplicate copies eliminated, aiKl is stored 
in so-called "channels" of information. Each channel rep- 
resents a particular category or group of categories of related 
information. The categorization of document content is 
generally done without reference to known user profiles or 

55 prejudices, although the categorization can be influenced by 
known or expected user query categories. 

On the so-called ''intemal" side of the inventive system, 
the Customer Intranet Server 14, is in communication with 
not only the outside System Server, but also with internal 

50 sources 15 and at least one end user 16. The end users may 
be employees of the customer or chents of the customer who 
have contracted or otherwise arranged for receipt of infor- 
mation which has been accumulated, categorized, and dis- 
seminated from the Customer Intranet Server site 14. 

65 An end user 16 will ^ecify the areas of interest for which 
thatend use r wishes to obtain electronic infnrniatinTi Un like 
prior art systems which allowed only minimal user query 



12/08/2003, EAST Version: 1.4.1 



us 6,182,066 Bl 



input, often limited to single word entries for simple word 
searching, the present system assembles a complex user 
query including the specific aijp n ot' multiple disparate topics 
of interest . The user profile is created by system componentis 
which are located at the Customer Intranet Server 14. 
"Creation'* of the user profile involves not only the extension 
o f user-input language, but also_the,eiimi natio n_Qf non - 
c ritical languag e, i nclusion of seman5c_knowledg e, a^d 
cross-relating of user Tnfere,<;t tnpif^ Query development is 
further detailed below. Once the user profile has been 
developed, it is stored at the Customer Intranet Server for 
m atching to.a sscmblcd and categorized content. The system 
can be programmed to conduct on-going matching (i.e., 
checking every new document entry for a match with the 
user profile), periodic matching (e.g., every 12 hours), or 
matching only upon user prompting (e.g., only when a user 
connects to the Customer and asks for an update). 

Continual or penodic categorization of external electronic 

content is the task of the system components which can be 
located at the System Server 10. The System Server receives 
input from the content providers 12, as well as possibly torn 
the internal sources IS via the Customer Intranet Server 14. 
Receipt of input from both external and internal sources can 
be a passive process, whereby the documents are continu- 
ously or periodically supplied to the System Server, or an 
active process, whereby system crawler components seek 
out the documents. 

The inventive system preferably includes provision to the 
customer site of at least one internal crawler which will 
provide a totally automated way to bring their entire dis- 
tributed networic resources into the system. The crawlers 
crawl through a customer's internal network and retrieve 
documents fi-om various sources, distinguished by the tech- 
nologies which were used to store the information. 

Documents from the internal sources are assembled and 
categorized at the Customer Internet Server, where a Chan- 
nel Map is created containing a list of web servers, direc- 
tories and other targets which have been or are to be crawled. 
A Channel Map can be constructed at the System Server as 
well. Each entry in the Channel Map may include a list of 
channels in which web pages and documents from the 
respective server and directory are to appear. Table 1 pro- 
vides a sample Channel Map for a fictitious semiconductor 
manufacturer: 



TYPE SERVER 



DIRECTORY 



CHANNELS 



Web HR /publiflhybcncfits/401k 401k 

Web HR ^ublish/jobopenings Jobs 

Web Marketing /publish^roduct/spccs Product Specs 

Web www.badco.coin ^ub^roductspccs Competition Specs 

Web www.goodco,com /jjubyijioducts/electmic Customer Products 

PCFile engineering /^rojects/ch^degjgps Clig» Designs 

PCFile marketing /reports/cos^Mmalysia Coni^itive Anly. 

FTP engineering /projccls/status Status Reports 

Notes cngineermg /spe9s/i^jpsji^d ^1200 Design 



FIG. 2 provides a schematic illustration of the sources 
accessible to the Customer Intranet Server of the fictitious 
company, directly or through the System Server, and the 
channels that result from receiving or crawling those 
sources. Information gathered from external sources will 
also be mapped to the established chaimels, so that an end 
user can readily access all relevant information in a category 
or channel as the result of a single query. 

While some amount of categorization may be 
straightforward, such as those above>noted examples 
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wherein any information obtained from a certain source will 
necessarily be provided on a given channel (i.e., with sites 
or site directories being mapped to the charmels), the bidk of'" 
document categorization requires intensive analysis of the 
document contents. In addition to the crawlers which auto- 
matically frmnel documents obtained from certain sources 
into pre-established chaimels, there are two other primary 
means by which documents are categorized. The first, and 
most rudimentary, is categorization by manual user 
interface, whereby a system administrator (or even docu- 
ment author) identifies the document to be loaded into the 
server and identifies the channels in which the document is 
to appear. The second, more complex, means is automatic 
categorization by content filtering, which is conducted by 
system components located at either the Customer Intranet 
Server or the System Server 10, the details of which are 
further provided below and in co-pending applications, Ser. 
No. 08/979,248,'entitled "Method and-SystemforElectronic- 
Document Content or Query Content Filtering", and Ser. 
20 No. 08/980,075, entitled "Content Filtering for Electronic 
Documents Generated in Multiple Foreign Languages'*, 
which are assigned to the present assignee, and are being 
filed on even date herewith. Such automatic categorization 
can also be utilized at the Customer Intranet Server for the 
purpose of categorizing internal documents into channels, 
which may match or be unique from the channels provided 
by the System Server. Such channel definitions can be 
applied as well to documents received from the System 
Server to fill the customer-defined channels with news or 
other external documents. After query processing and docu- 
ment content categorization, it is preferable to analyze the 
categories to ascertain if other relationships exist among the 
categories, which relationships themselves may be identified 
as new categories or channels, which is the subject of the 
present invention. The foregoing co-pending applications 
are incorporated by reference herein, as is co-pending patent 
application Ser. No. 08/979,861 entitled "Method and Sys- 
tem for Providing Access to Categorized Infonmation from 
OnLine Internet and Intranet Sources,'* which is assigned to 
the present assignee. 

Once documents from both the internal and external 
sources have been categorized/assigned channels, both the 
documents and the assigned channels arc stored in a local 
database at the Customer Intranet Server or associated 
customer location. Inventive components at the Customer 
Intranet Server match the channel assigned to each of the 
incoming documents with the user's interests as found in the 
tiser profile. Each document is then made available for 
access by, or is sent to, the user whose interests it matches. 

The System Server's above-noted fimctions may be pro- 
vided as part of a customer intranet, wholly outside of the 
customer domain, or divided in fimction between the two 
locations. In the "outside** example, all document collection 
and categorization would be done at the System Server as a 
service of the provider. Documents found on the external 
internet, as well as those which may be supplied from the 
customer's own intranet and/or databases, would be ana- 
lyzed and categorized at the provider location. In the 
instance where the customer wishes to additionally be a 
provider to end users, two alternative scenarios are possible. 
Under the first scenario, an outside provider would still 
assemble and categorize documents from outside sources 
and make them available at the customer's server. The 
customer's server would also be adapted to perform assem- 
bly and categorization of "in-house" documents, merging of 
the in-house assemblage with the categorized documents 
from outside sources, matching the resultant merged docu- 
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ments to user request profiles, and disseminating the match- 
ing results to the user. The second alternative implementa- 
tion would locate all categorization functionality at the 
customer location. In all three implementations, the cus- 
tomer location would retain the capability for receipt of user 
request input, creation and storage of the user profile, 
matching of the user profile to the categories or channels into 
which the documents are placed, and provision of the 
matched documents for end user review. 

The customer site is provided with the capability for 
building applications to create a series of different user 
interfaces with different interaction means, different restric- 
tions for user access (e.g., providing some users access to 
only documents from outside sources, while others would 
have access to both externally-obtained and internally- 
generated documents), and different levels of query and 
content complexity. 

' For the- detailed descriptions of the processing- "stages,"* 
including user query analysis and profile creation, document 
categorization, and matching, it is to be noted that the same 
types of analyses can frcqucntlv bc a p plied at .fiach^ge. For 
example, finjin g^rclationships between two sgeniinglYjlis- 
parate use r^u cry subject categories can paral leLtb&^ ffort t o 
identify common alitv of su bi£C t_ matter from two in put 
docume nts, as well as a subsequent effort to match the 
profileocajalegor y/channel. Therefore, where a ppropriate, 
the.ensuing.DKicess es will reference one, two or alljof p rofile 
analysi s, document content categpiization. and m atching 
stages. 

Users of the system initially specify which topics are of -sp 
interest This specification takes the form of a simple sub- 
scription to pre-defined user categories, a modified subscrip- 
tion whereby the user can alter or add to the pre-defined user 
categories, a user-customized set of queries, or any combi- 
nation of the foregoing. Each query represents a topic, and 
can additionally contain^boolean,_fiiz^,-pKMciroity_and/or 
hie rarchical operators. A s et of topics preferred by a user is 
know as a user profile. The present method reduces each 
query to one or more vector entries with the entry's index 
into the vector corresponding to a hash of the query's textu al 
expression of the importance of that query to the overall 
topic/profile. A query can be either a single token (word or 
phrase) or a combination of tokens which includes boolean, 
fuzzy, proximity and/or hierarchical jpperators. T^keaJDs 
are assigned to each query item as hereinafter detailed... 

Automatic query processiog, as well as document content 
categorization, is optimized in the present invention by first 
tokenizing the content thereof. In such a tokenization 
process, all the word/phrases are first identified as units, then 
stemmed. After all stop words and phrases 'are filtered out, 
only a few of the original word^hrases are left These 
surviving words/phrases are called tokens. The tokens are 
usually just the stems of the original words, or made-up 
labels which correspond to phrases. The stems or made-up 
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specific tokens matched, a lexicon will not be needed. 
Qearly, when comparisons are being made, comparisons of 
32 bit integers will be significantly faster than the prior art 
striEg comparisons. Textual messages are likewise mapped 
to vectors using the same procedures as were used for the 
topics, above. All vectors are then normalized. Classification 
and matching are thereby reduced to vector processing. 

Query processing suffers from the drawback that, even 
with tokenizing and vectorizing, a great deal of redundancy 
may be contained in large query sets. The redundancy 
increases CPU and memory consumption requirements for 
any of the categorization processes based on the query sets. 
Query processing can be streamlined by recognizing pos- 
sible hierarchical relations between queries in a set that has 
been previously indexed and vectorized, some of which may 
correspond to known topic categories or channels. In order 
to streamHne the query processing, after vectorization, the 
* foflowibg steps are^impienTented: " " " 

First, one calculates the cosine measure (dot product) of 
every query vector against every other vector. This will 
provide a similarity measure of every query against every 
other query in the database. The system stores all similarity 
measures that equal or exceed a pce-set threshold in a i^aise 
matrix. Those query vectors having similarity measures with 
scores below the threshold are assumed to have nothing in 
common, and therefore, are assigned an implicit similarity 
measure equal to zero. 

Standard clustering methods are applied to the sparse' 
matrix of similarity measures, implying a second threshold, 
the cluste rs are divid ed-ato^tJiMQ groups jcomprising (a ) 
clu sters of ^v ec tors whose similarity measures excee ds or 
equal s^ the second threshold; a nd (b) clusters o f vecto rs 
who se similarity measures do notcxcc ^lEesecond thresh- 
old. Members hip^jn group (a) or (b) is determined b y 
comparisQa-te~>a ^redefiDed .fiimilaxlty_thEc shold say, for 
example, 60%. Thus, those queries in a cluster that share 
greater than or equal to 60% of their tokens belong to group 
(a), while those that don't belong to group (b). The differ- 
ences between groups (a) and (b) is that the queries in (b) are 
not as strongly related as those in (a). 

The query vectors in group (a) share most of their terms. 
When shown a cluster of such queries, the information 
analyst must ask the following questions: "Are these queries 
related to one anodier?**; "If they are related, are they part of 
die same branch or related branches in the topic hierarchy?"; 
"If they are not currently related to one another, should they 
be related?"; "If so, what is the best way to relate \h€mT; 
and, "If they should not be related, what is the best way to 
avoid this clustering (overlap) of queries and of discrimi- 
nating between them in the future?". 

The queries in groups (a) and (b) may indicate to the 
information analyst: new links between previously unrelated 
pre-existing categories (forming new hierarchy branches); a 



labels are referred to as "terms". Terms are strings, and since 55 strengthening in the links between previously related cat- 



the system must handle quite a few thousand terms, the total 
memory which can be consumed by terms could take up a 
significant amount of computer memory. Therefore, a hash 
function is provided to assign unique token IDs to the terms 
(which may also consist of expressions containing words 
and phrases as terms combined with a variety of query 
operations) found in the documents and queries. The term 
strings are replaced by 32 bit integers. A "reverse dictionary** 
can be maintained which comprises a lexicon with token IDs 
as the keys and the words, phrases, queries as the values. 
However, if the need is to mark the document with 
categories, and not to catalog and retrieve based on the 



egories (consolidation and strengthening of existing hierar- 
chy branches); or, new links between pre-existing categories 
and entirely new categories (again, forming new hierarchy 
branches). If any lx)gus links are discovered between totally 
60 unrelated queries, those Unks must be avoided by refining/ 
enhancing the queries. 

Using the results of the clustering process, the informa- 
tion analyst can be presented with a list of queries in each 
cluster in group (a) or (b) and decide whether the queries 
65 truly have anything in common or not. If the queries already 
belong in branches of the same hierarchy, their tag^ will 
make this fact obvious, and the analyst may skip further 
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analysis. If, on the other hand, the tags do not show any 
relationship between the queries, the analyst may decide that 
further analysis of the individual queries is required. 

Finally, if two or more queries in a cluster have quite a 
few things in common, the common terms can be made into 
a single query vector and this query vector can be replaced 
by a singjle term in all of the original queries. This single 
replacement term corresponds to and represents the new 
query vector The foregoing procedure will reduce the 
amount of redundancy in the system. Of course, the infor- 
mation analyst must first decide that the terms involved are 
truly common to all the queries and that those terms will 
likely remain common throughout the life of the queries 
before consolidating them into a common query vector. 

In a similar manner, vector clustering can be utilized to 
automatically find new topic categories among the document 
content categories. Again assuming,that, all documents h^ 
been pre-indexcd and converted into normalized vectors, 
one calculates the c^ine measure (dot productV of every 
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associated with the category is produced. If, on the other 
hand, they only partially overlap, a new category and query 
is produced together with an improved version of the 
original query vector associated with the original category. 
The new link between them is implicit in the tenns which 
both query vectors share. 

For each cluster found, and for each group of document 
of type (C) within that cluster, a summary vector is calcu- 
lated from all the document vectors in the group. This 
summary vector is refined by the process of comparison with 
other summary vectors. The final, refined summary vector 
represents the query which will retrieve those documents 
with the highest recall and precision possible when issued 
against the corpus of documents stored in the document 
database. This query also represents a new category. 

Every new category requires a new tag/label that best 
represents that category. This tag is put together initially by 
_ concatenating the most representative (i.e., highest, ^scorc) 
terms of the query associated with the category. A quick scan 
of the text of documents retrieved by the query will locate 



document vector against every other document vector. fSs^ t^ose representative terms in their original context, will 
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will provide a similarity measure of every document against 
every other document in the database. The system stores 
those similarity measures that equal or exceed a preset 
threshold in a sparse matrix. Those documents vectors 
having similarity measures with scores below the threshold ^ 
are assumed to have nothing in common, and therefore are 
assigned an implicit similarity measure equal to zero. 

Once again, standard clustering methods are applied to-, 
the sparse matrix of similarity measures. Each cluster pro- 
duced will fall into one of the following groups: (A) doca-r 
ment vectors in the cluster mostly share common pre- 
existing category tags; (B) document vectors in the clustei 
share some common pre-existing tag categories; (C) docu- 
ment vectors in the duster share no category tags. The 
documents m group (A) closely match well known, pre 
existing categories. Thus, at first sight, they hold littk 
interest to information analysts. But, on further analysis, the 
analyst may find that much can be learned firom this group. 
For example, the analyst may ask: how closely related are 
these pre-existing categories that they show up in almost 
every document in the group? 

The document in groups (A) and (B) may indicate: new 
links between previously unrelated pre-existing categories; a 
strengthening in the links between previously related cat- 
egories; or, new links between pre-existing categories and 
entirely new categories. The documents in group (C) indi- 
cate the existence of previously unknown categories and of 
links between them. This is the most important category fijr 
the information analyst. 

For each cluster found, and for each group of documents 
of type (A) and (B) within that cluster, and for each matched 
category within a group, a summary vector is calculated 
from all the document vectors in the group matching that 
category. A summary vector is a single vector that best 
represents a cluster of nei^boring document vectors. It 
represents the average vector in the cluster and is calculated 
by taking the centroid of all of the vectors in the cluster. This 
smnmary vector is refined by a process of comparison with 
other suunmary vectors. The final, refined summary vector 
represents the query which will retrieve those document 
with the highest recall and precision possible when issued 
against the corpus of documents stored in the document 
database. This new query vector is then compared to the one 
associated with the original category, to determine to what 
extent they match. After this process, if there is enough of an 
overlap, an improved version of the original query vector 
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determine if those terms are part of any collocations which 
are not part of any term yet, and, if so, will replace those 
terms in the label with their most common collocations. The 
final label is scanned by a small parser ^ecializing in noun 
and verb phrases as may appear in a category label, to make 
sure that it is syntactically correct. Refined categories have 
their labels/tags enhanced by a process identical to the one 
described immediately above. 

The invention has been described with reference to sev- 
eral specific embodiments. One having skill in the relevant 
art will recognize that modifications may be made without 
departing from the spirit and scope of the invention as set 
forth in the appended claims. 

Having thus described our invention, what we claim as 
new and desire to secure by Letters Patent is: 

1. A method for categorizing electronic document content 
of a plurality of documents for matching to user requests 
comprising the steps of: 

parsing said document content into a plurality of items, 
each of said items comprising a contiguous phrase of 
more than two words located within said document; 

assigning each of said plurality of items at least one of a 
plurality of token IDs; 

vectorizing said plurality of token IDs into a plurality of 
document vectors; 

calculating the cosine measure of each of said document 
vectors against each other of said document vectors to 
provide a plurality of similarity measures, one similar- 
ity measure for each document against each other of 
said plurality of documents. 

2. The method of claim 1 further comprising the steps of: 
comparing each of said similarity measures to a pre-set 

threshold. 

3. The method of claim 2 further comprising storing each 
of said similarity measures which exceeds said pre-set 
threshold in a sparse matrix. 

4. The method of claim 3 further comprising clustering 
said stored similarity measures in a plurality of clusters 
according to said cosine measures. 

5. The method of claim 4 further comprising calculating 
a summary vector for each of said plurality of clusters. 

6. The method of claim 5 further comprising the steps of: 
identifying said summary vector as representing a new 

category for said documents in said cluster; and 
creating a new category tag for said documents in said 
cluster. 
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7. A method for categorizing user input query content for 
matching user requests to electronic document content com- 
prising the steps of: 

parsing said query content into a plurality of items^ each 
of said items comprising a contiguous phrase of more 
than two words located within said document; 

assigning each of said plurality of items at least one of a 
plurality of token IDs; 

vectorizing said plurality of token IDs into a plurality of 
query vectors; 

calculating the cosine measure of each of said query 
vectors against each other of said query vectors to 
provide a plurality of similarity measures, one similar- 
ity measure for each query against each other of said 
plurality of queries. 

8. The method of claim 7 further comprising the steps of: 
comparing each 'of said 'Similarity 'meas^^^ 

threshold. 

9. The method of claim 8 further comprising storing each 
of said similarity measures which exceeds said pre-set 
threshold in a sparse matrix. 

10. The method of claim 9 further comprising clustering 
said stored similarity measures in a plurality of clusters 
according to said cosine measures. 

11. The method of claim 10 further comprising calculating 
a summary vector for each of said plurality of clusters. 

12. The method of claim 11 further comprising the steps 

of: 

identifying said summary vector as representing a new 
category for said queries in said duster; 

graphically presenting said clusters for human analysis; 
arxi 

creating a new category tag for said queries in said cluster. 
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13. A method for categorizing electronic document con- 
tent of a plurality of documents for matching to user requests 
comprising the steps of: 

parsing said document content into a plurality of items, 
5 each of said items comprising one of a word or a 
contiguous phrase of words located within said docu- 
ment; 

assigning to said plurality of items at least one of a 
plurality of token IDs, said token IDs representing a 
IQ plurality of items 

vectorizing said plurality of token IDs into a plurality of 
document vectors; 

calculating the cosine measure of each of said document 
vectors against each other of said document vectors to 
23 provide a plurality of similarity measures, one similar- 
ity measure for each document against each other of 
said plurality of documents. 

14. A method for categorizing user input query content for 
matching user requests to electronic document content com- 
prising the steps of: 

parsing said query content into a plurality of items, each 
of said items comprising one of a word or a contiguous 
phrase of words located within said document; 

assigning to said phirality of items at least one of a 
plurality of token IDs, each of said token IDs repre- 
^ senting a plurality of items; 

vectorizing said plurality of token IDs into a plurality of 
query vectors; and 

calculating the cosine measure of each of said query 
vectors against each other of said query vectors to 
provide a plurality of similarity measures, one similar- 
ity measure for each query against each other of said 
plurality of queries. 

♦ ♦ ♦ * * 
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